Statistics I
Lisbon Accounting and Business School – Polytechnic University of Lisbon
These slides are a free translation and adaptation from the slide deck for Estatística I by Prof. Sandra Custódio and Prof. Teresa Ferreira from the Lisbon Accounting and Business School - Polytechnical University of Lisbon.
Consider the following scenario:
💼 Investor: What’s the probability this startup will succeed?
📊 Analyst: Hard to say—every startup is different.
💼 Investor: But if you had to guess, based on similar cases?
📊 Analyst: Maybe 1 in 3 succeed under these conditions.
💼 Investor: So, would you bet on it?
📊 Analyst: Yes, I would.
💼 Investor: Even if the odds aren’t great?
📊 Analyst: I believe this one has what it takes.
Here we can define probability in terms of frequency of occurrence, i.e. as a percentage of successes in a moderately large number of similar situations.
This is the most natural and traditional way of thinking about probability.
But, what if this company belongs to a completely novel market sector?
There might be situations where the frequency concept is not adequate, because it might refer to a one-time event. These are subjective beliefs.
A company is recruiting a new CEO, and a board member says:
“I believe there’s a 90% chance that our chosen candidate will be an effective CEO.”
It might seem easy to disregard the second case as unscientific or useless. However, many times people need to make decisions under uncertainty with not enough data (or no data at all!) about previous realizations of the specific event.
Beliefs allow the decision maker to, well, make some decision, at least consistently.
Uncertainty
A set is a collection of objects, which are elements of the set.
Definition
Let \(S\) represent a set, and \(s\) an element of that set, we write \(s\in S\) to mean \(s\) belongs to \(S\).
If \(s\) does not belong to \(S\), we write \(s\notin S\).
Definition
If a set \(S\) does not have any element, then it is the empty set, denoted by \(\emptyset\).
There are several ways to specify a set.
By extension, or as a list:
If a set \(S\) has a finite number of elements (\(x_i\in S\)) we can write it like this: \[S=\{x_1, x_2, ..., x_n\}\]
If a set \(S\) has an infinite (but countable) elements (\(x_i\in S\)) we can write it like: \[S=\{x_1, x_2, ...\}\]
By describing the property (\(P\)) that \(x\) must satisfy to be included in \(S\): \[S= \{x|x \text{ satisfies } P\}\] in this case \(|\) reads as such that. For example \[S=\{x\in\mathbb{R}|x>=0\}\] to describe the non-negative real numbers.
This example is special, as the positive real numbers cannot be written down as a a list. In this case the interval \([0,\infty)\) is an uncountable set.
Definition
If \(\forall x\in S\) it is also true that \(x\in T\), then we say that \(S\) is a subset of \(T\), and we write it like \(S\subseteq T\).
Definition
If \(S\subseteq T\) and at the same time \(T\subseteq S\) then we say that \(S\) and \(T\) are equal, and we write it \(S=T\).
Definition
The universal set \(\Omega\) is the set that contains all objects that could conceivably be of interest in a particular context.
By definition the, any set \(S\) must be a subset of \(\Omega\).
The universal set is important because it defines the scope of our analysis. Say we are studying the performance of students of Statistics I in 2025.
The cars parked outside our institution do not belong to the universal set, because they are not relevant for our purpose. Only students of Statistics I in 2025 belong to the universal set.
Definition
The complement of a set \(S\), with respect to \(\Omega\), is the set \(\{x\in\Omega| x\notin S\}\), that is, all the relevant elements that do not belong in \(S\). We denote it as \(S^c\).
Corollary: It is easy to see that \(\Omega^c=\emptyset\).
Definition
The union of two sets \(S\) and \(T\) is the set of all elements that belong to \(S\) or \(T\) (or both), and is denoted by \(S\cup T\). \[S\cup T=\{x\in\Omega | x\in S\ \vee\ x\in T\}\]
Definition
The intersection of two sets \(S\) and \(T\) is the set of all elements that belong to \(S\) and \(T\), and is denoted by \(S\cap T\). \[S\cap T=\{x\in\Omega | x\in S\wedge x\in T\}\]
Note that \(\vee\) stands for or, and \(\wedge\) stands for and.
Sometimes we might need to consider the union or intersection of many sets, and for that we can use a notation simmilar to the one we used for summations:
\[\bigcup_{n=1}^\infty S_n = S_1 \cup S_2 \cup ... = \{x\in\Omega | x \in S_n \text{ for some } n\}\]
\[\bigcap_{n=1}^\infty S_n = S_1 \cap S_2 \cap ... = \{x\in\Omega | x \in S_n \text{ for every } n\}\]
Definition
Two sets (say \(S\) and \(T\)) are said to be disjoint if \(S\cap T=\emptyset\).
More generally, a collection of sets \(S_n\) is disjoint if \(S_i\) and \(S_j\) are disjoint when \(i\neq j\).
Definition
A collection of sets is said to be a partition of a set \(S\) if the sets in the collection are:
Disjoint
Their union is \(S\)
We can use the notation of \(\mathcal{P}(\Omega)\)
The number of elements of a set \(S\) is known as its cardinality and it is denoted as \(\# S\). \(\# S\) satisfies:
If we have two sets, \(S\) and \(T\), we define the operation set minus as \(\setminus\) the set that contains all elements of \(S\) that do not belong to \(T\)
\[S\setminus T = \{x\in S| s\notin T\}\]
In probability, \(\Omega\), the universal set, is a non-empty set that contains all possible outcomes of an experiment. Each outcome is represented by \(\omega\), and obviously \(\omega\in\Omega\).
The sample space (\(\Omega\)) can be:
Consider the experiment of throwing a die 🎲 and noting the number shown on side facing upwards.
Consider now the random experiment of measuring the life expectancy of a lamp 💡, measured in hours.
Remember, \(\Omega\) must include all possible outcomes from your experiment! Even then ones that seem ludicrous.
Definition
A subset \(A\) of the sample space \(\Omega\) is called an event. \[A\subseteq \Omega\]
By definition then \(\Omega\) is also an event.
Definition
We call the realization of an event \(A\) if, after an experiment, outcome \(\omega\) is realized, and \(\omega \in A\).
Let’s go back to our experiment with the 🎲
The sample space is: \(\Omega=\{1,2,3,4,5,6\}\)
Within this space, we can define the following events:
Now let’s revisit the example of our 💡
The sample space is: \(\Omega=\{x\in\mathbb{R}|x\geq 0\}\)
In this space, we can define the following events:
Definitions
An elementary event is any event that contains a single element (i.e. \(\# A = 1\))
An impossible event is an event with no outcome, (i.e. \(\# A=0\)). As a consequence, an impossible event coincides with the empty set \(\emptyset\).
A certain event is indeed the event \(\Omega\), as for any outcome we obtain \(\omega\), this outcome belongs to the sample space \(\Omega\) by definition.
Consider two events \(A\) and \(B\) both subsets of \(\Omega\)
\(A^c\) contains all the outcomes that are not in \(A\). \(A^c\) is the event of not \(A\).
If \(A\subseteq B\), then an outcome that realizes event \(A\) (\(\omega\in A\)), also realizes \(B\), as \(A\subseteq B\Rightarrow \omega\in B\) as well. \(A\Rightarrow B\)
For \(A\cup B\) to happen, we need \(\omega \in A\) or \(\omega \in B\), which means that \(A\) happens, or \(B\) happens, or both happen simultaneously.
For \(A\cap B\) to happen, we need \(\omega \in A\) and \(\omega \in B\), which means that \(A\) and \(B\) happen simultaneously.
\(A\) and \(B\) are incompatible if \(A\cap B=\emptyset\), i.e. if an outcome is in one set, it cannot be in another, for example it cannot be that \(A\) and \(A^c\) happen simultaneously!
Consider two events \(A\) and \(B\) both subsets of \(\Omega\)
Consider the sample space \(\Omega =\{1,2,3,4,5,6\}\), from our 🎲 case.
Define the events: \[A=\{1\},\ B=\{3,6\},\ C=\{2,4,6\},\ D=\{4,5,6\}\]
Let’s define the following events in \(\Omega\)
Besides the concepts we already saw of frequency and subjectivity for probability, there was an older, called “classic” one. This one was introduced by Pierre-Simon Laplace in 1812.
Laplace or Classic interpretation of probability
Let \(A\) be an event defined over a finite \(\Omega\). The probability of event \(A\) is defined as:
\[P(A)=\frac{\# A}{\# \Omega}\]
Consider now an experiment throwing two dice 🎲 🎲
The problem with this interpretation, is that we cannot use it, or it becomes meaningless, it when \(\Omega\) is uncountable or infinite. Also, what if the outcomes are not equally likely? (i.e. if the dice are not fair?)
This is today still the dominant interpretation of probability.
In this case, what we want is to observe several independent repetitions of the experiment. After a while, some statistical regularity begins to emerge.
Logically, if you run an experiment, and are interested in the probability of event \(A\), then the events you are registering are \(A\) and \(A^c\) or not \(A\).
Every time you run your experiment, you count when you get an \(A\) and when you observe an \(A^c\) event. Obviously, the total number of experiments is how many times you observed \(A\) and how many times you observed \(A^c\).
Experiment: Draw a random number in the interval \([0,1]\). \(A\) denotes \(x<0.4\).
\(A\) | \(A^c\) | \(N\) | \(P(A)\) |
---|---|---|---|
0 | 1 | 1 | 0 |
2 | 8 | 10 | 0.2 |
17 | 33 | 50 | 0.34 |
39 | 61 | 100 | 0.39 |
217 | 283 | 500 | 0.434 |
802 | 1198 | 2000 | 0.401 |
As you can see, the more experiments we run, the more stabilized the ratio of occurrences for \(A\) over the total number of experiments. More generally:
\[P(A)=\lim_{N\rightarrow \infty}\frac{A\text{ occurrences}}{N \text{- Number of Experiments}}\]
This is the relative frequency of \(A\) in \(N\) experiments: \(f_A\)
Not always possible to repeat that many times the experiment in the same conditions.
It seems that the probability that the random number between 0 and 1 is below 0.4 is approximately 40%. The more experiments we run, the closer our relative frequency is to that number.
\[P(A)\underset{N \rightarrow \infty}{\rightarrow} 0.4\]
By the way, we will see later that theoretically, indeed \(P(A)=0.4\)
Andrey Kolmogorov defined a set of characteristics that any probability \(P\) measure should have, these are called the Kolmogorov’s axioms (1933):
Let \(A\) and \(B\) be some events in \(\Omega\)
Conditional probability, as the wording implies, means the probability of something happening given something else has happened. Now, note “something” here makes reference to an event.
\[P(A|B)\]
It reads the probability of \(A\), given \(B\).
Note that if we think on sets, saying given \(B\) we are immediately excluding everything that could have happened if \(B\) did not happen, and therefore our Universal set is no longer \(\Omega\), but \(B\).
What we are looking for are, among the events that live in \(B\), how many of those live in \(A\) (because those would trigger event \(A\)). Actually, we are interested on the relative measure of those outcomes, compared to the whole size of \(B\): \[P(A|B)=\frac{P(A\cap B)}{P(B)}\]
Consider a factory that makes 10 wrenches 🔧. Among those, we know that 2 have imperfections. Suppose you intend to remove, randomly, 2 🔧 from the lot (of 10). Consider the following events:
\(A = \{\text{The first :wrench: is faulty}\}\) \(B = \{\text{The second :wrench: is faulty}\}\)
What if we want to compute \(P(B)\)? For a correct assessment for \(B\), we would better have some information on the realization of \(A\)!
If the fist 🔧 was faulty, then \(A\) happened. If the first 🔧 was ok, then \(A^c\) happened, and therefore we can compute \(P(B|A)\) and \(P(B|A^c)\). We are assuming that we are removing these 🔧 without replacing them.
Let’s see why this last detail (replacing the 🔧) is so relevant before going on.
Initial set:
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|
🔧 | 🔧 | 💥 | 🔧 | 💥 | 🔧 | 🔧 | 🔧 | 🔧 | 🔧 |
Remove one (if randomly you do not know which), but after you remove you can see what happened, let’s take out 7.
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|
🔧 | 🔧 | 💥 | 🔧 | 💥 | 🔧 | 🔧 | 🔧 | 🔧 |
We observe, and voilá it was a fine wrench 🔧. If we replace it though, we would be picking from
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|
🔧 | 🔧 | 💥 | 🔧 | 💥 | 🔧 | 🔧 | 🔧 | 🔧 | 🔧 |
That is in the exact same conditions we made our first choice, and therefore what happens with the first pick is irrelevant: These events are now independent!
With \(A\), we know that when we pick the second wrench (\(B\)), in the box there are 2 💥 and 8 🔧.
With \(A^c\), we know that when we pick the second wrench (\(B\)), in the box there are 2 💥 and 8 🔧.
The probability of getting a 💥 is the same in each scenario! \(P(B|A)= P(B|A^c)\)!
\[P(B|A)=\frac{2}{10}=0.2\ \text{and}\ P(B|A^c)=\frac{2}{10}=0.2\]
Initial set:
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|
🔧 | 🔧 | 💥 | 🔧 | 💥 | 🔧 | 🔧 | 🔧 | 🔧 | 🔧 |
Remove one (if randomly you do not know which is broken)
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|
🔧 | 🔧 | 💥 | 🔧 | 🔧 | 🔧 | 🔧 | 🔧 | 🔧 |
We observe, and voilá it was a broken wrench 💥 \(A\) happened!
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|
🔧 | 🔧 | 💥 | 🔧 | 💥 | 🔧 | 🔧 | 🔧 | 🔧 |
We observe, and voilá it was a fine wrench 🔧 \(A^c\) happened!
With \(A\), we know that when we pick the second wrench (\(B\)), in the box there are 1 💥 and 9 🔧.
With \(A^c\), we know that when we pick the second wrench (\(B\)), in the box there are 2 💥 and 9 🔧.
The probability of getting a 💥 is different in each scenario! \(P(B|A)\neq P(B|A^c)\)!
\[P(B|A)=\frac{1}{9}=0.111\ \text{and}\ P(B|A^c)=\frac{2}{9}=0.222\]
So formally
Definition
Let \(A,B\subset\Omega\), the we say the probability of \(A\) given \(B\) is the conditional probability: \[P(A|B)=\frac{P(A\cap B)}{P(B)}\] Note that from here, we can obtain also \(P(A\cap B)=P(A|B)P(B)\). In both situations we need \(P(B)\neq 0\).
It follows that the identity \[P(B|A)=\frac{P(A\cap B)}{P(A)}\] or \[P(A\cap B)=P(B|A)P(A)\] with \(P(A)\neq 0\) also holds true.
Now let’s think on \(P(A\cap B \cap C)\):
Note that given the commutativity of the intersection, we could have obtained also:
And to make sense of all of this we need \(P(X)>0\), \(P(X\cap Y)>0\) with \(X,Y\in\{A,B,C\}\).
Consider a region with 1,000 adults. Their job data is captured by the following table:
Employed | Unemployed | Total | |
---|---|---|---|
Women | 470 | 55 | 525 |
Men | 430 | 45 | 475 |
Total | 900 | 100 | 1,000 |
Let’s define the events:
\(W=\{Woman\}\), \(M=\{Man\}\), \(U=\{Unemployed\}\)
\(P(U|W)=P(U\cap W)P(W)=0.055\times 0.515=0.105\)
\(P(W|U)=P(W\cap U)P(U)=0.055\times 0.1=0.55\)
Definition
Two events \(A\) and \(B\) \(\subset\Omega\), are probabilistically independent if and only if: \[P(A\cap B)=P(A)P(B)\]
From the definition of independence, we can obtain several properties. Let \(A\) and \(B\) independent events with \(P(A)P(B)>0\):
Are \(W\) and \(U\) from the previous example independent?
\(P(W)=0.525\), \(P(U)=0.1\), \(P(W|U)=0.55\), \(P(U|W)=0.105\).
Note that \(P(W)\neq P(W|U)\) and \(P(U)\neq P(U|W)\). Therefore, they cannot be independent.
Consider a die 🎲 that is thrown twice. Consider the following two events:
\(A=\{\text{The die shows an odd number the first time}\}\) \(B=\{\text{The die shows a number }>4\text{ the second time}\}\)
Are \(A\) and \(B\) independent events?
In this case, \(\Omega=\{(x,y)\in \mathbb{N}^2| x,y \leq 6\}\) with \(\# \Omega = 6^2=36\)
\(P(A)=\frac{18}{36}=\frac{1}{2}\)
\(P(B)=\frac{12}{36}=\frac{1}{3}\)
\(P(A\cap B)=\frac{1}{6}=P(A)P(B)\)
\(P(A|B)=\frac{P(A\cap B)}{P(B)}=\frac{1}{2}=P(A)\)
\(P(B|A)=\frac{P(B\cap A)}{P(A)}=\frac{1}{3}=P(B)\)
They are independent events!
Two events being independent is not the same that they being incompatible:
\(A\) and \(B\) independent | \(A\) and \(B\) incompatible |
---|---|
\(P(A\cap B)=P(A)P(B)\) | \(P(A\cap B)=0\) |
\(P(A|B)=P(A)\) and \(P(B|A)=P(B)\) | \(P(A|B)=0\) and \(P(B|A)=0\) |
Let \(A\) and \(B\) be two events such that: \(P(A)=0.6\), \(P(B)=t\), and \(P(A\cup B)=0.8\)
Find \(t\) such that \(A\) and \(B\) are:
From probability theory, we have \[P(A\cup B)=P(A)+P(B)-P(A\cap B)\] and therefore we get: \[0.8 = 0.6+t\Rightarrow t=0.2\]
Using the same identity we just used: \[0.8=0.6+t-0.6t\Rightarrow t= 0.5\]
Theorem
Let \(\{A_i\}_{i=1}^n\) be a partition of \(\Omega\), or \(\{A_i\}\in \mathcal{P}(\Omega)\). Then, for any \(B\subset\Omega\), it holds that:
\[P(B)=\sum_{i=1}^n P(A_i\cap B)=\sum_{i=1}^n P(B|A_i)P(A_i)\]
Consider a financial institution that sells two products, \(\alpha\) and \(\beta\), with very high yields. It is known that, among its clients, 10% invest a share of their wealth in \(\alpha\) and the rest in \(\beta\). From those who invest in \(\alpha\), 70% manage to get returns above the market. From among those who do not invest in \(\alpha\), 55% get returns above the market. Randomly choosing a client of this firm, find the probability this customer gets a return above the market.
Let’s define the events:
Matching with the available data we obtain:
\(P(A_1)=0.1\), \(P(A_2)=0.9\), \(P(B|A_1)=0.7\), and \(P(B|A_2)=0.55\).
From the Law of Total Probability:
\[P(B)=\sum_{i=1}^2 P(A_i\cap B)\] \[P(B)=P(B|A_1)P(A_1)+P(B|A_2)P(A_2)\] \[P(B)=0.7\times 0.1 + 0.55\times 0.9 = 0.565\]
Bayes Theorem
Let events \(A_1\), \(A_2\), … , \(A_n\) with \(n\in\mathbb{N}\) a partition of \(\Omega\), then, for any event \(B\subset\Omega\), with \(P(B)>0\):
\[P(A_i|B)=\frac{P(A_i\cap B)}{P(B)}=\frac{P(B|A_i)P(A_i)}{\sum_{i=1}^n P(B|A_i)P(A_i)}\] with \(i=1,2,...n\)
Note that this is a consequence of the Law of Total Probability.
On the other side, \(\sum_i P(A_i)=1\) and \(\sum_{i} P(A_i|B)=1\)
Bayes Theorem has been widely used in economics, in biomedical sciences, and social sciences when looking for causality.
If event \(B\) represents consequences and event \(A_i\) probable cause, Bayes Theorem allows to assess the probability of this cause \((P(A_i))\).
Let’s go back to the previous example, about our investors.
Let’s compute the probability that the client invested his money on product \(\beta\), but given that the client had returns above the market (event \(B\)).
If the customer invested in \(\beta\), then the event we are trying to is \(A_2\), but conditional on event \(B\), \(P(A_2|B)\):
\[P(A_2|B)=\frac{P(A_2\cap B)}{P(B)}=\frac{P(A_2)\times P(B|A_2)}{\sum_i P(A_i)\times P(B|A_i)}\]
We knew from the previous exercise that \(P(B)=0.565\), and therefore we obtain:
\[P(A_2|B)=\frac{0.9\times 0.55}{0.565}=0.876\]
How do we interpret this?
The probability that the client invested in \(\beta\), given that he had a return above the market, is 0.876.
All these computations can be very easy with the help of the following table:
\(A_i\) | \(P(A_i)\) | \(P(B|A_i)\) | \(P(A_i)P(B|A_i)\) | \(P(A_i|B)\) |
---|---|---|---|---|
\(A_1\) | 0.1 | 0.7 | 0.07 | 0.124 |
\(A_2\) | 0.9 | 0.55 | 0.495 | 0.876 |
1 | 0.565 | 1 |
Let’s verify now if the event \(A_1\) and \(B^c\) are independent or not!
According to the definition of independence: \(P(A_1\cap B^c)=P(A_1)P(B^c)\)
Then \(A_1\) and \(B^c\) are not independent.
Statistics I